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Abstract 



There exists a large variety of machine learning algorithms; as most of these can be configured via hyper- 
parameters, there is a staggeringly large number of possible alternatives overall. There has been a consid- 
erable amount of previous work on choosing among learning algorithms and, separately, on optimizing 
hyper-parameters (mostly when these are continuous and very few in number) in a given use context. 
However, we are aware of no work that addresses both problems together. Here, we demonstrate the 
feasibility of using a fully automated approach for choosing both a learning algorithm and its hyper- 
parameters, leveraging recent innovations in Bayesian optimization. Specifically, we apply this approach 
to the full range of classifiers implemented in WEKA, spanning 3 ensemble methods, 14 meta- methods, 
30 base classifiers, and a wide range of hyper-parameter settings for each of these. On each of 10 popular 
data sets from the UCI repository, we show classification performance better than that of complete cross- 
validation over the default hyper-parameter settings of our 47 classification algorithms. We believe that 
our approach, which we dubbed Auto-WEKA, will enable typical users of machine learning algorithms 
to make better choices and thus to obtain better performance in a fully automated fashion. 



1 Introduction 



Many users of machine learning tools are non-experts, who require off-the-shelf solutions to the problems 
they are tackling. The machine learning community has much aided these users by making available a 
wide variety of sophisticated learning algorithms through open source packages, such as WEKA HI and 
PyBrain Q. What remains is the considerable challenge of choosing the right learning algorithm for a given 
problem; furthermore, most algorithms have so-called hyper-parameters that need to be set to suitable values 
to obtain good performance in a given use context. The difficulty of this challenge and the amount of effort 
required to evaluate alternatives encourages users simply to choose an algorithm based on its reputation or 
intuitive appeal, and to leave hyper-parameters set to their default values. Unfortunately, this approach can 
lead to performance much below that of the best method and hyper-parameter settings. 
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We propose that users be given automated tools for choosing both a learning algorithm and its hyper- 
parameters. Currently, we are not aware of any such general tools. Likely explanations include the high 
dimensionality of the combined space of learning algorithms and their hyper-parameters, the fact that these 
dimensions involve categorical and continuous choices, and the notorious difficulty of stochastic optimiza- 
tion problems generally. However, research in high-dimensional stochastic optimization - and in particular, 
Bayesian f - has made great strides in the past decade, and produced methods that can provide the basis 
for a general tool for choosing learning algorithms and their hyper-parameters. Such a tool would offer 
obvious benefits to novice users, but would also be helpful to machine learning experts confounded by large 
design spaces, seeking to expose additional hyper-parameters, or interested in learning the strengths and 
weaknesses of different algorithms. 

Although the methods we describe are general, for concreteness, our work focuses on classification prob- 
lems: learning a function / : X y with finite y. A learning algorithm A maps a set {di, . . . , d n } 
of training data points d, = (x^,j/j) £ X x y to such a function, which is often expressed via a vec- 
tor of model parameters. Many learning algorithms A further expose hyper-parameters A £ A, which 
change the way the learning algorithm A\ itself works. (For example, a hyper-parameter might describe 
a description-length penalty, the number of neurons in a hidden layer, the number of data points that a 
leaf in a decision tree must contain to be eligible for splitting, etc.) These hyper-parameters are typically 
optimized in an "outer loop" which evaluates the performance of each hyper-parameter configuration us- 
ing cross-validation. In this context, it is common to use a grid search that effectively reduces A to a 
manageable size by limiting each hyper-parameter to a discrete set of values, evaluates each grid point via 
cross-validation, and returns the hyper-parameter configuration that achieves the smallest cross-validation 
error. However, this approach is feasible only when the dimensionality of the hyper-parameter space is very 
low, and nevertheless requires domain knowledge to achieve a reasonable discretization. 

It has been demonstrated recently that random search is almost always preferable over grid search and often 
produces high quality results that can rival those produced by an expert with manual tuning 1 3 1 . The machine 
learning community has also developed more sophisticated methods for hyper-parameter optimization. Ho- 
effding races [4 | reduce the total effort spent by iteratively increasing the number of cross-validation folds 
to consider for each learning algorithm, estimating their loss function and discarding learning algorithms 
that appear significantly worse than others. Another recent line of work follows a Bayesian optimization [5 1 
approach, also known as sequential model-based optimization [6|; it iteratively constructs a model of how 
the loss function depends on the hyper-parameters, uses this model to select the next set of hyper-parameters 
to evaluate, and incorporates the resulting loss function into the model [7|. 

When multiple learning algorithms are considered, the simplest approach for determining which method 
and hyper-parameters to use is simply to employ one of the optimization techniques just discussed for 
each learning algorithm under consideration. However, this method makes inefficient use of computational 
resources, and becomes quickly impractical as the number of candidate algorithms grows. One alternative 
solution is to provide the user with a ranking of suggested algorithms based on data set characteristics |8|. 
However, this approach does not avert the need to optimize hyper-parameters and to select among top- 
ranked learning algorithms. 

In what follows, we argue that learning algorithm selection and hyper-parameter optimization should be 
treated as a single combined problem, and demonstrate that by running recent Bayesian optimization meth- 
ods, high quality results can be obtained with minimal human effort. We first describe some of these 
optimization methods (Section|2]l. We then define a concrete algorithm selection and hyper-parameter opti- 
mization problem encompassing the full range of classifiers in the open source package WEKA (Section[3]l, 
and show that we can improve over the performance obtained by the default configuration of the best classi- 
fier for a given problem (Section|4]) as determined by cross-validation and Hoeffding races flU; in particular, 
the sequential model-based optimization procedure SMAC achieves higher test set accuracies on 9 out of 
10 well-known data sets and ties on the 10th. 
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2 Combined method and hyper-parameter selection 



Many learning algorithms have hyper-parameters that are only active if other parameters are instantiated to 
certain choices. For example, the two parameters of a Support Vector Machine's polynomial kernel are not 
relevant if we use a different kernel instead (e.g., the RBF or Pearson VII function-based universal kernel). 

Following J5], we say that a hyper-parameter A^ is conditional on another hyper-paramter Xj, if A; is only 
active if hyper-parameter Xj takes values from a given set Vi{j) C Ay, and we call Xj the parent of Xi. 
Conditional hyper-parameters can in turn be parents of other conditional hyper-parameters, giving rise to a 
tree-structured space [7] or, in some cases, a directed acyclic graph (DAG) |9|. 

We model the problem of selecting one of fc learning methods A\, . . . , Ak with associated hyper-parameter 

spaces AW AW as a single combined hyper-parameter optimization problem with algorithm A and 

parameter space A. This combined problem features the union of the parameters and their domains in 
A^ 1 ), . . . , Aw, plus a new root-level hyper-parameter A r € {A\, . . . , A^} that selects between the k meth- 
ods. The root-level parameters of each subspace AW are made conditional on X r being instantiated to 
Ai. 

The goal of hyper-parameter optimization is to determine the hyper-parameters A* optimizing general- 
ization performance of A\* based on a limited amount of training data T> — {(xi,yi), . . . , (x n , y n )}- 

(i) 

Generalization performance is approximated by splitting V into disjoint training and validation sets T>\ r ' ain 
and T>^ Hd , learning functions /j by applying A\' to T>^J ain and evaluating the predictive performance of 

(i) 

these functions on D^',.., This allows for the hyper-parameter optimization problem to be written as: 

fc 

c(A) = ^£(44.^ 
»=i 

A* G argmin c(A) 
aga 



There are a number of ways that the training data can be broken up into training/validation pairs {T> train , T> v ^ lid ). 
The most common of these is fc-fold cross-validation, which splits the training data into fc equal sized par- 
titions X>f 3 Md , . . . , V[% d , and sets v[ l J am =V\ V^ ] aUd for i = 1, . . . , k. It is important to note that as fc 
grows the size of the validation set shrinks such that the variance in the loss function for each individual fold 
C(A\, T> fr a s n , ^vliid) mcreases - One method for counteracting this is the technique of Repeated Random 
Sub-Sampling Validation (RRSSV), which simply splits V into random partitions \D t1 ! ain , "D^', id } with a 
fixed size of both T>^} in and T>^\ id . This technique allows for an almost arbitrary number of balanced 
splits, the loss contributions of each of which have relatively low variance[|] 

In principle, Problem [T] can be tackled in various ways. A particularly promising approach is Bayesian 
Optimization [5|, and in particular Sequential Model-Based Optimization (SMBO |6|). SMBO (outlined 
in Algorithm [TJ first builds a model A4c that captures the loss function with respect to hyper-parameter 
settings. It then uses Aic to obtain a candidate configuration of hyper-parameters A c ; this can, e.g., be done 
by maximizing the expected positive improvement (EI, computed based on A4c'& predictive distribution) 
over a performance threshold. Next, SMBO evaluates the loss of A c , updates Aic with the new data point 
obtained and the process repeats. Recall that the function to be optimized in (TTl is a mean over a number of 
misclassification rates, each corresponding to one pair of T->\'. and ^\Lua constructed from the training 
set. SMBO allows for evaluating single components of that sum at a time. 

In the following sections we review two specific SMBO algorithms that are suitable to the task of model 
and hyper-parameter selection, and describe two baseline methods. 

'However, since the folds are correlated there are diminishing returns of additional splits. 
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Algorithm 1 SMBO 

1: H <— (f), initialise model M.^ 

2: while time budget for optimization has not been exhausted do 

3: A* candidate configuration from M. l 

4: Compute c = L(A X * , V^ am , 2?« J 

5: H<-Hll {(c, A*)} 

6: Update M l given H 

7: end while 

8: return Best from % 



2.1 Sequential Model-based Algorithm Configuration (SMAC) 

Sequential model-based algorithm configuration (SMAC |6|) models c(A) directly as p(c | A). SMAC 
supports a variety of models, including approximate Gaussian processes and random forests. Here, we use 
random forest models, since they are known to perform well for discrete and high-dimensional input data. 
While random forests are not usually treated as probabilistic models, SMAC obtains a predictive mean /i\ 
and variance a\ 2 of p(c | A) as frequentist estimates over the predictions of the individual trees for A; it 
then models p(c | A) as a Gaussian J\f(fi\, ax 2 ). 

It then uses the standard EI criterion measuring expected improvement over the best performance f m i n 
known so far. This can be computed by the closed-form expression 

E[/(A)] =a x - [u-$(u) + (p(u)], 

where u = ^ mi g.~ MA , and ip and $ denote the probability density function and cumulative distribution 
function of a standard normal distribution, respectively ifTUl ). SMAC also supports conditional parameters 
by simply instantiating inactive conditional parameters to default values for model training and prediction. 

SMAC is designed for robust optimization under noisy function evaluations, and as such implements special 
mechanisms to keep track of its best known configuration and assure a high confidence in its performance. 
This robustness against noisy function evaluations can be exploited in hyper-parameter optimization by 
evaluating a single training/validation fold at a time - this yields much faster yet more noisy function 
evaluations. For its incumbent configuration, SMAC executes additional function evaluations with different 
folds up to the total number of available folds, but poorly performing configurations can be discarded based 
on less function evaluations. 

Finally, SMAC also implements a safeguard mechanism to achieve robust performance even when its model 
is misled. This is achieved by choosing each second configuration to be evaluated at random. Note that the 
overhead of this safeguard is limited, since poorly performing configurations are only evaluated for a single 
fold due to the mechanism above. 



2.2 Tree-structured Parzen Estimator (TPE) 

The Tree-structure Parzen Estimator (TPE 1 7 1) is a SMBO algorithm that uses the formulation of EI in ([TJ, 
which requires a threshold value c*. While SMAC models p(c | A) explicitly, TPE instead models both 
p(c) andp(A | c). 

E[I C * (A)] := f max(c* - c, 0) PMl (c | X)dc (1) 

J — oo 

To model p(\ \ c), the observation history is divided into two pieces based on the hyper-parameter's loss 
evaluation compared against c*, which is the 7-quantile of the losses in T~L. All (c, A) £ % where c < c* 
are used to form the density estimate ^(A), while all (c, A) € H where c > c* are used to form the density 
estimate g(X). Intuitively, this creates a probabilistic estimator for hyper-parameters that appear to do 
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'good', and a different estimator for hyper-parameters that appear 'poor' with respect to the threshold. It is 

shown in Q that E[J C « (A)] cx ^7 + §p^-(l — 7)^ ■ This expression can be maximized by first generating 
many candidate hyper-parameter configurations at random, and then picking A* with the smallest value of 
g(X*)/£(X*)- 

Both £(X) and 17(A) are constructed in the same way. A tree is generated that captures the dependence 
between hyper-parameters, with an additional root node that is a parent to all hyper-parameters that are not 
conditional on any other. For each node in the tree, a 1-D Parzen estimator is created to model the den- 
sity of the node's corresponding hyper-parameter. The algorithm adds points to each estimator by starting 
at the root of the tree, then descending to nodes where the conditionality of the hyper-parameter is satis- 
fied. For each visited node, a sample is placed in the 1-D Parzen estimator corresponding to the node's 
hyper-parameter. To evaluate a candidate hyper-parameter A's probability estimate, this same traversal is 
repeated, but now at each node p(Aj) is computed, and combined to generate a single probability once the 
traversal completes. Note that this means that TPE assumes independence for hyper-parameters that are not 
connected on a path to the root of the tree. 

2.3 Exhaustive evaluation and Hoeffding races on default hyper-parameter config- 
urations 

As a base-line for our evaluation, we also considered two algorithm selection methods that do not perform 
hyper-parameter optimization. The first of these evaluates the loss function at every pair of T>^ ain and 

(i) 

^ valid usm 8 default hyper-parameters for each learning algorithm. Once this has been completed, the 
classifier with the lowest average loss is selected to be run on the testing set. We call this procedure 
Exhaustive Default (Ex-Def). 

The other method we use is a Hoeffding race (H-Race) |4|. Initially, a H-Race begins by having all given 
learning algorithms 'participate' in the race. It then iterates over the set of training/validation splits, evaluat- 
ing the loss function for all participating algorithms. At the end of each iteration, a statistical test (based on 
Hoeffding's formula) is applied to determine if any of the participants should be removed. In each iteration, 
a confidence interval is computed for the estimated mean performance. Any participant in the race that has 
a best confidence estimate that is inferior to the worst case estimate of the race leader's mean is removed 
from the race. The idea behind this method is that algorithms that have obviously poor performance are 
dropped early in the race, so the computational effort that would have been expended on them by exhaustive 
search can instead be put to use to evaluate stronger algorithms in more detail. 



3 Auto-WEKA 

To demonstrate the feasibility of an automatic approach to learning algorithm and hyper-parameter selec- 
tion, we built a tool that solves this problem for models implemented in the WEKA package [ 1 1 . The 
results we present here only consider classification problems, but the same approach would also work in 
other settings. 

Table [TJ provides a list of all 47 WEKA classification algorithms that are able to handle both numeric and 
categorical instance attributes. Of these models, 30 are considered 'base' classifiers (which can be used by 
themselves), 14 of the remaining classifiers are meta methods (which take a single base classifier and its 
parameters as an input), while the remaining 3 ensemble classifiers can take any number of base classifiers 
as input. We allowed the meta methods to use any base classifier with any hyper-parameter settings, and 
allowed ensemble methods up to five base classifiers. 

The algorithms in Table [TJ have a wide variety of hyper-parameters, that take on values from continuous 
intervals, ranges of integers and from discrete sets. We associated either a uniform or log uniform prior 
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Table 1 : Classifiers in Auto-WEKA: Classifiers marked with * are meta methods, which take in addition 
to their own parameters one 'base' classifier and its parameters. Classifiers marked with + are ensemble 
methods that take as input up to 5 'base' classifiers and their parameters. 



Name 


Disc. 


Cont. 


Name 


Disc. 


Cont. 


Bayes Net 


2 





Logistic Model Tree 


7 


1 


Naive Bayes 


2 





NB Tree 








Logistic Regression 





1 


Random Forest 


5 





Single Layer Perceptron 


5 


2 


Random Tree 


8 





RBF Network 


2 


1 


REP Tree 


4 


1 


SVM 


4 


4 


Cost-Complexity Pruning Tree 


4 





KNN 


5 





Adaboost Ml* 


3 





K* 


3 





Additive Regression* 


2 


1 


Hyper Pipes 








Bagging* 


2 





Voting Feature Intervals 


1 


1 


Dagging* 


1 





Conjunctive Rule 


4 


1 


Classification Via Regression* 








Decision Table 


4 





Decorate* 


3 


1 


DT/NB Hybrid 


3 





Ensemble of Nested Dichotomies* 


2 





RIPPER 


3 


1 


LogitBoost* 


8 


1 


NN using Generalized Examples 


2 


1 


MultiBoost Adaboost* 


5 





1R 


1 





MultiClass Classifier* 


2 


1 


PART 


4 





Random Committee* 


1 





Ripple Down 


3 


1 


Random Subspace* 


2 





AD Tree 


2 





Random Subspace* 


2 





BF Tree 


6 





Threshold Select* 


2 





Decision Stump 








Voting + 


1 





Functional Tree 


5 


1 


Grading" 1 " 


1 





C4.5 Decision Tree 


3 


1 


Stacking" 1 " 


1 





LogitBoost AD Tree 


1 








with each numerical parameter, depending on its meaning. For example, we set a log uniform prior for the 
ridge regression penalty, and a uniform prior for the maximum depth for a tree in a random forest. If we 
were to discretize the intervals and ranges such that there were at most 10 values each could take, there 
would be over 10 25 different possible parameter settings. 

Auto-WEKA can be understood as a single learning algorithm with one top-level Boolean parameter, 
is_base, that selects among single base classifiers and meta or ensemble classifiers. If is_base is true, 
then the parameter class determines which of the 30 base classifiers are to be used. If is_base is false, 
then class indicates either an ensemble or a meta classifier. If class is a meta classifier, then the param- 
eter base_class is chosen to be one of the 30 base classifiers. In the event that class is an ensemble 
classifier, an additional parameter num_classes is an integer chosen from 1-5. class_i variables are then 
selected based on the value of num_classes, which again select which of the 30 base classifiers to use. For 
each *class parameter, conditional hyper-parameters for every model are attached. This results in a very 
wide tree that captures all the hierarchical nature of the model hyper-parameters, and allows the creation of 
a single function that SMBO can be applied to. 



4 Evaluating Auto-WEKA 

In this section, we experimentally study the performance that can be achieved by applying automated meth- 
ods and hyper-parameter selection techniques in Auto-WEKA. Auto-WEKA is impartial to the choice of 
optimizer, so we studied the performance of all methods described in Section [2] SMAC, TPE, Hoeffding 
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Table 2: Datasets Used: Num. Discrete and Num. Continuous refer to the number of discrete and continuous 
attributes of elements in the dataset. 



Name 


Num. Discrete 


Num. Continuous 


Classes 


Size 


Abalone 


1 


7 


28 


4177 


Car Evaluation 


6 





4 


1728 


German Credit Card Data (GCCD) 


13 


7 


2 


1000 


Ionosphere 





34 


2 


351 


Iris 





4 


3 


150 


King-Rook vs King-Pawn (KR-vs-KP) 


37 





2 


3196 


Waveform 





40 


3 


5000 


Wine Quality - White 





11 


11 


4898 


Breast Cancer Wisconsin - Diagnostic (WBC) 





30 


2 


569 


Yeast 





8 


10 


1484 



races, and exhaustive evaluation of the default]^] 

4.1 Experimental setup 

We evaluated Auto-WEKA on 10 prominent benchmark datasets (see Table |2j, all obtained from the UCI 
repository ifTTl . Each dataset was partitioned once into a 70 % train - 30 % test random split. The test data 
was never seen by any optimization method; it was only used in an offline analysis stage to evaluate the 
configurations found by the various optimization methods. 

A completely instantiated optimization method needs to define which training/validation splits it operates 
on. We evaluated hyper-parameter configurations based on standard 10-fold cross-validation for each of 
Ex-Def, Hoeffding races, TPE, and SMAC. Additionally, since SMAC supports incremental evaluations on 
individual terms of a mean objective function, we used it with repeated random sub-sampling validation. 

All of our experiments were run on Linux machines with Intel Xeon X5650 six-core processors, running 
at 2.66GHz. Each experiment had at most 2GB of RAM available for use - if this limit was exceeded, the 
job was terminated. All experimental methods were given 10 000 seconds of CPU time to complete, after 
which they were terminated. When performing a step towards estimating the error, a timeout was set at 400 
seconds for the computation to finish]^] 

Finally, we note that both TPE and SMAC are randomized algorithms and are expected to produce different 
results based on the random seed provided. As demonstrated in lfl3l . this allows for trivial yet effective 
parallelization of the optimization process: simply perform k independent runs of the optimization method 
in parallel with the 10 000 second cutoff and select the result of the run with the lowest cross-validation error 
to return. The more parallel runs, the faster this process can be expected to identify excellent configurations; 
however, we restricted Auto-WEKA to run only 4 parallel jobs in order to facilitate runs on a standard 
desktop machine. 

4.2 Results 

In order to determine how effectively Auto-WEKA performed, we computed the test performance. This was 
evaluated by running the chosen learning algorithm and selected hyper-parameters on the entire training set 
T> and then evaluating the learned function on the withheld testing set. 

2 We thank the authors of SMAC and TPE for providing their respective implementations. 

3 The particular value of 400 seconds was chosen to allow almost all validations to complete in time. The only timeouts we 
observed were for RBF networks on data set Abalone, and these runs also did not complete within 10 hours. The 10 000 second cutoff 
was arbitrary and was imposed to keep the computational cost of the experiments reasonable. A promising approach for predicting 
runtimes and taking them into account as an integral part of the expected improvement formulation was presented by 1121 . 
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Table 3: Performance on testing data. We performed 25 runs of each of Hoeffding races, SMAC, and TPE, 
and report results as median (20% quantile, 80% quantile) percent error rate across 1 000 bootstrap samples. 
Ex-Def is deterministic. The error rate is determined by training the selected model/hyper-parameters on 
the entire 70% of training data, then computing the accuracy on the previously unused 30% of testing data. 



Dataset 


Ex-Def-10 (%) 


H-Race-10 (%) 


TPE-10 (%) 


SMAC-RRSSV (%) 


Abalone 


74.86 


74.86 (73.50,74.86) 


73.90 (73.18, 74.38) 


73.42 (73.02,74.42) 


Car 


0.77 


0.77 (0.77, 0.77) 


0.39 (0.00,0.77) 


0.39 (0.19, 0.58) 


GCCD 


28.67 


28.67 (28.67, 28.67) 


28.00 (28.67,30.33) 


27.67 (26.33,28.67) 


Ionosphere 


8.57 


8.57 (8.57, 8.57) 


8.57 (6.67,9.52) 


6.67 (6.67,7.62) 


Iris 


4.44 


4.44 (4.44, 0.44) 


4.44 (2.22, 6.67) 


2.22 (2.22,4.44) 


KRvs. KP 


0.73 


0.73 (0.73, 0.73) 


0.63 (0.52,0.63) 


0.42 (0.31,1.04) 


Waveform 


14.33 


14.33 (14.33, 14.33) 


14.60(14.26,15.20) 


14.13(14.13,14.20) 


WBCD 


3.53 


3.53 (3.53, 3.53) 


3.53(2.94,4.11) 


3.53 (2.94,3.53) 


Wine Quality 


35.26 


35.26 (35.26, 35.26) 


33.97 (33.28,36.69) 


33.90(32.13,35.13) 


Yeast 


41.35 


41.35 (41.35,41.35) 


39.78 (38.42,41.34) 


39.33 (38.42,40.44) 




Figure 1 : Trajectories of training performance over time on two representative datasets, for Hoeffding races, 
TPE, and SMAC, compared to the performance achieved by Ex-Def. 



Table [5] summarizes the results obtained with all methods. In all 10 datasets, TPE and SMAC (the Bayesian 
optimization methods) yielded performance better than or equal to Hoeffding races and Ex-Def (which 
only perform model selection). Among the Bayesian optimization methods, SMAC performed better on 
8 of 10 datasets, with 2 ties. SMAC's 20% to 80% quantile range based on bootstrap sampling (picking 
4 runs at random out of a pool of pre-computed runs to simulate different parallel runs) also is entirely 
below the same quantile range of Hoeffding races for six datasets, demonstrating the robustness of SMAC's 
performance. Finally, Figure[T|shows that the two Bayesian optimization approaches also achieved the best 
training set performance over time, with SMAC performing somewhat better than TPE. Since Hoeffding 
races - like SMAC - can also handle large numbers of training/test splits, we also performed preliminary 
experiments with H-Race using repeated random sub-sampling validation. These experiments indicate that 
the performance of H-Race indeed improves, outperforming SMAC in two cases (GCCD with 27.0% and 
KR vs. KP with 0.31%), and tying in 3 (Iris, WBCD, and Yeast). 

Looking at the classifiers picked by each of the methods for every dataset, TPE and SMAC have a tendency 
to pick among the same set of classifiers for each dataset. Qualitatively, the most common methods selected 
are the ensemble methods, with Random Forests selected a close second. 
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Table 4: Datasets used for additional experiments: Num. Discrete and Num. Continuous refer to the number 
of discrete and continuous attributes of elements in the dataset. 



Name 


Num. Discrete 


Num. Continuous 


Classes 


Size 


Arcene 





10 000 


2 


900 


Arrhythmia 





101 


17 


452 


Hill-Valley 





601 


2 


606 


Secom 





591 


2 


1567 


Semeion 


256 





10 


1593 



Table 5: Performance on testing data for higher-dimensional data. We performed 25 runs of each of Ho- 
effding races, SMAC, and TPE, and report results as median (20% quantile, 80% quantile) percent error 
rate across 1 000 bootstrap samples. Ex-Def is deterministic. The error rate is determined by training the 
selected model/hyper-parameters on the entire 70% of training data, then computing the accuracy on the 
previously unused 30% of testing data. 



Dataset 


Ex-Def-10 (%) 


H-Race-10 (%) 


TPE-10 (%) 


SMAC-RRSSV (%) 


Arcene 


8.33 


8.33 (8.33, 8.33) 


8.33 (6.67 16.60) 


13.33 (8.33, 34.33) 


Arrhythmia 


33.33 


33.33 (33.33, 33.33) 


33.33 (32.59,41.4) 


15.00(8.33, 34.81) 


Hill-Valley 


7.73 


7.73 (7.73, 7.73) 


(0, 0) 


(0, 0) 


Secom 


8.09 


8.09 (8.09, 8.09) 


8.09 (8.09, 8.09) 


7.87 (7.87, 7.87) 


Semeion 


8.18 


8.18 (8.18, 8.18) 


7.34 (6.91,8.34) 


5.87 (5.45, 5.87) 



4.3 Additional experiments on high-dimensional data 

Additionally, experiments were performed on datasets with higher dimensionality, all acquired from the 
UCI repository (Table [4j. However, these experiments used a modified version of Auto-WEKA. After the 
learning algorithm has exhausted its CPU budget for training, an interrupt is sent to the learning algorithm 
to terminate as soon as possible. Thus, models that are not fully trained (to convergence) may be used for 
evaluation on the validation/testing sets. This is in contrast to the previous version, which reports an error 
rate of 100% for models that do not finish training in their CPU time budget. 

Table [5] summarizes the results using all methods. As before, SMAC and TPE all give performance better 
than or equal to Hoeffding races and Ex-Def. On the Hill- Valley dataset, both TPE and SMAC were able 
to consistently find hyper-parameters that allowed for models with 0% error on the unseen testing set. On 
the Secom and Semeion datasets, the 20% to 80% quantile range of SMAC-RRSSV's bootstrap sampling 
falls completely below that of the other methods. On the Arcene dataset, while SMAC-RRSSV is unable to 
achieve performance equal to or better than the default, SMAC-CV10 consistently finds hyper-parameters 
with an error rate of 8.33% (which is equal to the error rate of Ex-Def). 

Unlike the previous set of experiments, here the most often picked models are meta-classifiers, frequently 
relying on network based models as their base classifier. 

We also conducted preliminary experiments on additional large datasets (again from the UCI repository) 
with up to 100 000 attributes and 125 000 elements. We had to increase the available CPU time budget 
for these experiments to 3 600 seconds and 4GB of RAM for each round of training, and 100 000 seconds 
of overall optimization time. These experiments produced results qualitatively similar to those above, and 
in a 10 000 attribute case (the Amazon Commerce reviews set'), using Auto-WEKA produced a relative 
performance improvement of 7% over Ex-Def. 
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5 Conclusion and future work 



In this work, we have shown that the daunting problem of combined algorithm selection and hyper-parameter 
optimization can be solved well enough to give rise to a practical, fully automated tool for choosing a ma- 
chine learning algorithm from a large set of candidates, such as the full range of classification algorithms 
in WEKA, and determining good hyper-parameter settings for a given use context. The best results are 
achieved using two recent sequential model-based optimization techniques that use predictive models to 
determine which algorithms and hyper-parameter configurations are evaluated. The Auto-WEKA tool we 
have built combined with one such technique, SMAC [6|, makes it easy for non-experts to find the best clas- 
sification algorithm within WEKA along with a good hyper-parameter configuration for a given application 
scenario, with little human time and within a reasonable amount of (fully automated) computation. 

We see several promising avenues for future work. First, we see potential value in extending our current ap- 
proach to allow parameter sharing between classifiers used within ensemble methods. Second, we could use 
our approach as an inner loop for training ensembles of machine learning algorithms by iteratively adding 
algorithms with maximal marginal contribution (this idea is conceptually related to the Hydra approach for 
constructing algorithm selectors 1141 ). 



References 

[1] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten. The WEKA data mining 
software: an update. ACM SIGKDD Explorations Newsletter, 11(1):10-18, 2009. 

[2] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. RiickstieB, and J. Schmidhuber. 
PyBrain. Journal of Machine Learning Research, 2010. 

[3] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine 
Learning Research, 13:281-305, 2012. 

[4] O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classification and 
function approximation. In Advances in Neural Information Processing Systems, volume 6, pages 
59-66, April 1994. 

[5] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost 
functions, with application to active user modeling and hierarchical reinforcement learning. Technical 
Report UBC TR-2009-23 and arXiv:1012.2599vl, Department of Computer Science, University of 
British Columbia, 2009. 

[6] F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm 
configuration. Learning and Intelligent Optimization, pages 507-523, 2011. 

[7] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for Hyper-Parameter Optimization. In 
25th Annual Conference on Neural Information Processing Systems (NIPS 2011), 201 1 . 

[8] PB. Brazdil, C. Soares, and J. P. Da Costa. Ranking learning algorithms: Using IBL and meta-learning 
on accuracy and time results. Machine Learning, 50(3):25 1-277, 2003. 

[9] F. Hutter, H.H. Hoos, K. Leyton-Brown, and T. Stiitzle. ParamlLS: an automatic algorithm configura- 
tion framework. Journal of Artificial Intelligence Research, 36(l):267-306, 2009. 

[10] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black box 
functions. Journal of Global Optimization, 13:455^492, 1998. 

[11] A. Frank and A. Asuncion. UCI machine learning repository, 2010. University of California, Irvine, 



School of Information and Computer Sciences. URL: http : / /archive . ics . uci . edu/ml 



10 



[12] J. Snoek, H. Larochelle, and R.P. Adams. Opportunity cost in Bayesian optimization. In NIPS Work- 
shop on Bayesian Optimization, Sequential Experimental Design, and Bandits, 2011. Published on- 
line. 

[13] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Parallel algorithm configuration. In Proc. of LION -6, 
2012. To appear. 

[14] L. Xu, H. H. Hoos, and K. Leyton-Brown. Hydra: Automatically configuring algorithms for portfolio- 
based selection. In Proc. ofAAAI-10, pages 210-216, 2010. 



11 



